RNA secondary structure factorization in prime tangles

Background Due to its key role in various biological processes, RNA secondary structures have always been the focus of in-depth analyses, with great efforts from mathematicians and biologists, to find a suitable abstract representation for modelling its functional and structural properties. One contribution is due to Kauffman and Magarshak, who modelled RNA secondary structures as mathematical objects constructed in link theory: tangles of the Brauer Monoid. In this paper, we extend the tangle-based model with its minimal prime factorization, useful to analyze patterns that characterize the RNA secondary structure. Results By leveraging the mapping between RNA and tangles, we prove that the prime factorizations of tangle-based models share some patterns with RNA folding’s features. We analyze the E. coli tRNA and provide some visual examples of interesting patterns. Conclusions We formulate an open question on the nature of the class of equivalent factorizations and discuss some research directions in this regard. We also propose some practical applications of the tangle-based method to RNA classification and folding prediction as a useful tool for learning algorithms, even though the full factorization is not known.

(G-U). Figure 1 shows a primary and secondary structure along with its dot-bracket notation, a string in which a pair of matching brackets correspond to a weak bond in the secondary structure and dots unpaired nucleotides. The dot-bracket string can also be represented by a flattened diagram, that is a set of points displayed horizontally (representing the nucleotides) joined by an arc in the upper half part of the diagram (representing the pairs). Since every arc has to connect two dots, every flattened diagram has N arcs and 2N paired dots.
Depending on the bonds present in the secondary structure, different types of brackets may be needed to avoid ambiguity. The folding process gives rise to some interesting structural features (loops) that can be categorized as hairpins, bulges, stems, interior loops (see Fig. 2), and multiloops (see Fig. 3).
It is often the case that RNA secondary structures form a pseudoknot, where an unbonded nucleotide is bonded with another nucleotide in a different loop of the RNA molecule (Fig. 3). Predicting the optimal structure with pseudoknots during the folding process, also known as the RNA folding problem, often requires a prohibitive amount of time. Although great efforts were put to solve this problem, both from an algebraic [2,19,20,22] and a machine learning perspective [25], there is still room for improvements.
Due to its pivotal role in biological processes, the study of RNA secondary structures is of great importance. The process of protein production is the result of the interaction of three types of RNA: transfer RNA, ribosomal RNA, and messenger RNA. Viruses have evolved to inject their genome (in the form of RNA) into the host cells in order to replicate themselves. Moreover, it is still in the debate that the self-replicating capabilities of Fig. 1 RNA structures, dot-bracket notation and flattened diagram. Example of a RNA found in Mus musculus (house mouse) [18]. Its primary structure is on the left and the secondary structure is on the right, along with its dot-bracket representation and flattened diagram. Image generated using FORNA [9] Fig. 2 Patterns emerging from a secondary structure. Example of various patterns that can emerge from a secondary structure. Blue nucleotides are part of a hairpin, green ones are part of stems, yellow nucleotides are part of a bulge, brown ones are part of an interior loop RNA may have given the basis for early life on Earth even before DNA appeared (RNA World Hypothesis [11,14]).
This work proposes a different way to investigate RNA folding with an algebraic structure during the process of optimization, exploiting its decomposition in prime factors.

Brauer monoid
A monoid is an algebraic structure made by a set of elements and an associative binary operator equipped with an identity element.  Fig. 5). We can compose two tangles by identifying the bottom row of the first with the top row of the second one and then redraw the edges accordingly (see Fig. 4). The set of all tangles on 2N points under the composition operator • is called the Brauer Monoid B N [3].
Edges in the form e = a : b ′ are called transversals, and in the cases when a > b ′ , a < b ′ or a = b ′ we call them positive, negative, and zero transversal respectively. Edges in the form e = a : b or e = a ′ : b ′ are called upper and lower hooks respectively [6]. The size of an edge e = a : b , with a and b arbitrary dots, is defined as |e| = |a − b|. Fig. 3 A pseudoknotted tRNA. Secondary structure of the yeast phenylalanine tRNA along with its dot-bracket representation [1]. The folding forms a pseudoknot because of the G-C pair at positions 18-50 and pair G-C at position 14-42. There are three multiloops (coloured in red) at the base of the three stems with hairpins B N is closed under composition and its identity is I N = 1 : 1 ′ , 2 : 2 ′ , ..., N : N ′ . A tangle P is called prime if it can only be written in the form P = I N • P = P • I N . There are two types of primes tangles ( Fig. 6): Note that crossings in a tangle are only introduced by T -primes. T -primes and U -primes are the generators for all tangles in B N under composition, this means that we can reduce any tangle to a prime factorization. It is useful to note here that factorization in the Brauer Monoid is not unique.
A factor list F for a tangle X is a list of prime tangles in the form P x 1 • P x 2 • · · · • P x i such that their composition gives back X. The length of a factor list F is indicated by |F| . The factor list F of the identity tangle I N is the empty list, whose size is |F| = 0.
For each tangle X ∈ B N , we call the factorization problem the task of finding the factor list of minimal length.

Methods
The first attempt to draw a connection between RNA secondary structures and tangles in the Brauer Monoid was due to Kauffman and Magarshak [12]. Their intuition was that the number of parenthesis in RNA dot-bracket representation and the number of dots in a tangle is always even, and each open parenthesis must correspond to a closed parenthesis somewhere in the string, corresponding with the existence of an edge in a tangle. Therefore, they provided the following procedure for converting an RNA secondary structure to a tangle: 1. flatten the secondary structure in a single long chain (equivalent to the dot-bracket notation); 2. discard the unpaired nucleotides, there are now 2N nucleotides and N pairs; 3. abbreviate stacked arcs to a single arc. We will call this reduced diagram shape [10,21]; 4. rotate the second half of the shape diagram above the first; 5. enumerate the nucleotides in the top row with numbers in [N] and nucleotides in the bottom row with numbers in [N ] ′ .
As Giegerich et al. pointed out, the study of the shape of an RNA secondary structure lifts the user from the burden of paying attention to changes that do not affect the overall desired structure, which means that we do not lose information because we are doing a static analysis [10]. In this context, the procedure described above gives us the opportunity to study the shape of RNA secondary structures in terms of tangles and generators for these tangles. For this purpose, we wrote an algorithm capable of finding the minimal amount of prime compositions for any given tangle [16]. We classify tangles in the following way: a tangle X = X ′ • U i (X has a lower hook h of size |h| = 1); T L-tangle: a U-tangle with the extra condition of having only U-primes as factors (no edge in X intersect with another edge. T L stands for Temperley-Lieb, those who first described them [23]); H-tangle: all the other tangles ( H stands for big hook because they will always have a lower hook h of size |h| > 1.) For a visual example see Fig. 7. For each class of tangles, we provide an algorithm for calculating its factorization.

Factoring T -tangles
The set of T -tangles on 2N dots is actually isomorphic to the symmetric group S N , therefore we can represent any T -tangle X as a permutation in the form and we can find an optimal factorization by sorting the bottom row of X. Since every T -prime is equivalent to an adjacent swap, we are limited to O(N 2 ) algorithms, like BubbleSort.

Factoring T L-tangles
Ernst et al. defined a factorization algorithm that constructs a minimal factor list given an input T L-tangle [7]. Their algorithm works by subdividing the tangle to factorize in vertical columns and then enumerating all regions of odd depth (called 1-regions) that this subdivision generates. Each region will correspond to a U-prime, and if two regions R 1 and R 2 are diagonally adjacent, with R 1 having a lower depth than R 2 , then they write R 1 → R 2 , therefore constructing a Directed Acyclic Graph (DAG) of regions. By reading this graph left to right and from top to bottom, they obtain a minimal factor list. Our implementation of their algorithm takes quadratic time. For a more detailed explanation, the reader can refer to the original paper. Recall that a U-tangle is a tangle in the form X = X ′ • U i , we would like to find X ′ by removing U i from X. To do this, we will merge the lower hook h = i ′ : i ′ + 1 with another edge in the tangle. We say that we merge a lower hook h = i ′ : i ′ + 1 and an edge e = e 1 : e 2 by removing them from X and adding edges a and b such that if e is a hook or a negative transversal, then a = e 1 : i ′ and b = i ′ + 1 : e 2 and if e is a positive transversal, then a = e 1 : i ′ + 1 and b = e 2 : i ′ .
Since the number of crossings in a tangle corresponds to the number of T -primes in its factor list, we would like this merging process to maintain the crossing number constant, in this way we are sure to not include any more T -primes in the non-optimal factor list we are calculating.

number of crossings and with a lower
For all edges e = h calculate inter(e) to be the number of intersections e has with edges in I. Let S = {e : e ∈ X, inter(e) = 2} be the set of edges that intersect both edges in I, for each e ∈ S calculate the number of crossings the tangle X ′ would have if we merged h with e and pick the tangle whose number of crossings is equal to c. If more than one edge satisfies this last condition, among them, pick the edge that has the least amount of intersections in X.
Note that, for the case of edges in I, it will happen that some edges in X will share a dot with edges in I. We count them too as intersecting.
Merging two edges takes constant time, but the calculation of the crossing number takes O(N 2 ) [24], and since we have to merge h with N edges in the worst case, the time complexity for this heuristic is O(N 3 ).

Factoring H-tangles
We will extract factors from a H-tangle X by transforming it into a U-tangle. The idea is to take one of the lower hooks h with size |h| > 1 and shrink it until it becomes of size one. To do this we compose X with T -primes until this condition is met. During the shrinkage process, other edges will inevitably change size. In order to decide where we should shrink h, we use a heuristic that chooses a location where the size of the other edges increases the least. We apply this heuristic to the smallest lower hook of X, in this way there will be no smaller lower hook inside of it.

Heuristic 2
Given a H-tangle X, let h = i ′ : i ′ + k be the smallest lower hook of X of size k > 1 . Let j be the index of the shrinkage location where the size of the other edges increases the least. Shrink the lower hook h into location j by composing X with The notation F −1 indicates the reverse of a factor list, given This heuristic is not optimal, but it can be computed in linear time.

Minimal factorization
The heuristics mentioned above do not always yield a minimal factorization, therefore a minimization step is required. It turns out that prime tangles follow a particular set of rules (see Table 1) [13]. We call R1-10 delete rules and R11-13 move rules. We can use them to minimize a non optimal factor lists by implementing them in a rewriting logic tool (we chose the Maude System [5,15]).

From RNA to tangle factorization
We will now provide an example of the mapping procedure for deriving, from a RNA secondary structure, a tangle with its prime factors. We will start from the modified E. coli tRNA in Fig. 8 [8], and apply Kauffman and Magarshak's mapping to obtain the flattened diagram in Fig. 9a.
This diagram is reduced to obtain a shape diagram (Fig. 9b) that can be folded to get the corresponding tangle (Fig. 9c). We can now factorize it by using the methods discussed previously (Fig. 10). Figure 10 shows the four steps of the factorization algorithm: (a) The algorithm recognizes that X is a U-tangle because there is a lower hook of size 1 ( 2 ′ : 3 ′ ). Therefore it can be rewritten as X = X ′ • U 2 . The algorithm applies Heuristic 1 that determines that the upper hook 2 : 4 in the only one intersecting the two imaginary edges (the two vertical dotted lines) twice. Therefore these two edges are merged and we obtain the tangle X ′ . The prime U 2 is yielded and the algorithm moves to the next step. (b) The rewritten tangle X ′ is a T -tangle. The algorithm applies BubbleSort that firstly extracts T 4 , thus shrinking the edge 3 : 5 ′ to 3 : 4 ′ and obtaining X ′′ . (c) The BubbleSort applies one more swap, which corresponds to a T 3 and delivers X ′′′ (d) The algorithm has now reached the identity tangle ( X ′′′ ) and the first part of the factorization process has terminated.
Thus the yielded factorization is T 3 • T 4 • U 2 . Now the algorithm moves to the rewriting logic step, whose aim is to ensure that this is the minimal factorization and, if it is not, to find a better one. Since there is no move rule that can lead to the application of a delete rule, the algorithm concludes that this factor list is minimal (Fig. 11c).
An online interactive demo that calculates these steps automatically is available [17]. Figure 12a is an example of a RNA molecule that does not have any pseudoknots, therefore its corresponding tangle will not have any crossings. This implies that it will be mapped to a T L-tangle, which we know can be factorized using Ernst's algorithm. To obtain the corresponding tangle we apply Kauffman and Magarshak's mapping. We take its secondary structure (represented as a flattened diagram in Fig. 12b) and reduce it to a shape diagram (Fig. 12c). The shape diagram can now be folded in half to obtain the tangle in Fig. 13a. We then apply Ernst's algorithm by dividing it into five columns (Fig. 13b), i.e. by drawing imaginary edges that connect each upper dot to its corresponding bottom dot, and selecting for each of them the regions of odd depth (Fig. 13c). We then build the DAG by connecting two regions R 1 and R 2 if they are diagonally adjacent and R 1 is above R 2 (Fig. 13d). To each node will now correspond a region, and each edge will indicate when two regions are diagonally adjacent. We then read the graph nodes from top to bottom and from left to right. If a node is in column i, then we will write in output the prime tangle U i (Fig. 13e).

RNA with pseudoknots
Suppose to have a complex RNA secondary structure that yields the tangle in Fig. 14a. Since it is a H − tangle , for this example our algorithm applies Heuristic 2 on the smallest lower hook (in this case there is only one, namely 2 ′ : 7 ′ ). To choose where we should shrink this lower hook, the algorithm calculates which shrinkage location increases the size of the other edges the least ( Table 2).
Since in this case the heuristic found two best locations, 1 and 2, it randomly chooses location number 1. Therefore 2 ′ : 7 ′ will be shrunk to a lower hook 2 ′ : 3 ′ and the factorization yielded so far is T 3 • T 4 • T 5 • T 6 , the reverse of the factorization for this location (we record the reverse because if during factorization we need to shrink the lower hook, during composition we need to expand it). The algorithm now tries to factorize the tangle returned from the last step (Fig. 14b). Since it is a U -tangle, the algorithm will apply Heuristic 1. It will select the lower hook 2 ′ : 3 ′ and check which edges intersect with the imaginary edges in the set I = {2 : 2 ′ , 3 : 3 ′ } . The only edge intersecting both is 4 : 1 ′ , therefore 2 ′ : 3 ′ and 4 : 1 ′ are merged together. This step returns the tangle in Fig. 14c and yields the prime factor U 2 . Since the tangle in Fig. 14c is still a U-tangle, the same step is applied again, returning the T -tangle in Fig. 14d and yielding the prime factor U 1 . This last tangle can be factorized  Table 2 This table calculates, for each edge inside lower hook 2 ′ : 7 ′ , how much it would increase (or decrease) in size if the algorithm shrunk lower hook 2 ′ : 7 ′ into shrinkage locations from 1 to 5 The best shrinkage location is selected among those who have the minimal sum of these sizes (1 and 2 in this case). The optimally by applying Algorithm 1, which yields the factorization T 3 • T 5 • T 6 • T 5 by performing the following steps: This last step returns the identity tangle, therefore the algorithm stops and yields the factorization This factorization is minimal therefore the reduction step is not necessary.
Reduction of a non-minimal factorization Suppose the following non-minimal factorization term is given: The rewriting logic step minimizes the term by performing the following rewrites using the rules presented in Table 1.
In the last step there are no delete rules applicable and no move rules that eventually lead do a delete. Therefore this factor list is minimal.

Results
The resulting tangle is invariant to synonymous mutations, which are mutations that do not change the secondary structure. This is due to the fact that we discard unpaired nucleotides and abbreviate stacked arcs, allowing multiple secondary structures to map to the same factorization. This also allows researchers to move their attention to patterns in the factorizations of their desired shapes. A less obvious result (already observed by Kauffman and Magarshak) is that every secondary structure without pseudoknots maps to a T L-tangle. The intuition behind this result is that the number of valid ways we can arrange 2N open and closed parenthesis of a single type is the Catalan number which is exactly the number of tangles with non-crossing edges in B N [4,23]. This also implies that every pseudoknotted secondary structure corresponds to a tangle with at least one crossing, and thus at least one T -prime as a factor.
Let us show some other properties using the example we provided in the previous section (Fig. 11). In the corresponding tangle, only stems and pseudoknots are visible and they are encoded in the factorization. Starting from stem s 1 , six pairs are identified with the unique vertical edge, which does not have corresponding factors. Its presence, however, causes the indexes of the prime tangles to be shifted by one (Proposition 1). The three pairs of the stem s 2 correspond to the 2 ′ : 3 ′ arc generated by the factor U 2 . The stem s 3 , corresponding to the edge 5 : 4 ′ , is generated by T 4 . This is because its two endpoints were situated in the first and second half of the flattened secondary structure, causing it to be represented as a diagonal edge. The stem s 4 , identified with the edge 2 : 4, is generated by T 3 • U 2 (note that T 4 and U 2 can commute, see "Discussion" section). Lastly, the pseudoknots are identified with edge 3 : 5 ′ generated by T 3 and T 4 , which are the factors in common with the edges that it crosses, 2 : 4 and 5 : 4 ′ (Proposition 3).
We will give a mathematical foundation for these empirical results. Given a section s of an RNA secondary structure, stem or pseudoknot, we write edge(s) = (i, j) to denote its corresponding edge in the RNA shape (or tangle) beginning in position i and ending in position j (with i < j ). Given a tangle X and an edge e ∈ X , we will write gen(e) to indicate the factors that generate it.

Proposition 1
If an RNA secondary structure has a stem s with edge(s) = (1, 2N ) , then the index of every factor of the corresponding tangle X ∈ B N will always be greater or equal to two. The converse is also true.
Proof Assume that an RNA shape has an edge e = (1, 2N ) . Let X be the corresponding tangle, then 1 : 1 ′ ∈ X and therefore there is no prime T 1 or U 1 in the factorization of X. The backward argument is also valid.

Proposition 2 Let s be a stem of an RNA secondary structure and let p be a pseudoknot starting inside the hairpin of s and ending outside of it. Then edge(s) will cross edge(p).
Proof We can abstract edge(s) to be a 2-dimensional closed curve S ⊂ R 2 by closing its two ends with a horizontal line. We then have that edge(p) starts inside of S and ends outside of it. By the Jordan Curve Theorem on R 2 we know that edge(p) must cross S , and since we assume that in the shape diagram all edges are situated in the upper portion of the diagram we know that edge(p) must cross edge(s). Proposition 3 Let X be a tangle with e 1 , e 2 ∈ X and let G = gen(e 1 ) ∩ gen(e 2 ) . If e 1 and e 2 cross, then there exists T i ∈ G for some i.
Proof Since e 1 and e 2 cross, they must share a prime tangle P that generates their crossing. But since every intersection is generated by a T -prime, P must be a T -prime. This implies that a T -prime generates both e 1 and e 2 .

The existence of equivalent factorizations leads us to reason about an open question:
Open Question What is the biological interpretation of commutative factors and, in general, of equivalent factorizations?
We hypothesise two separate research directions, regarding: • equivalent factorizations up to commutativity (R13) • equivalent factorizations up to R11 and R12 The reason for this distinction is that R13 does not really impose a challenge during factorization, recall that R13 is defined as: The number of prime factors P i and P j remains unchanged, whereas in R11 and R12: The number of T i s is two on the left side and one on the right for R11, and for R12, the left side and the right side do not even share a common factor. Since the factorization yielded by R11 and R12 is fundamentally different, we think that they have a different biological interpretation than R13.
We can also discuss another research direction by analyzing different mappings from RNA secondary structures to tangles. For example, in the mapping we discussed in this paper, if there is a pseudoknot p connecting stems s 1 and s 2 then in the corresponding tangle there will be three edges, one for each of them. In this framework, the interaction between two stems is represented by an edge intersecting their corresponding edges. We could, instead, think of another mapping in which stems connected by a pseudoknot will have their corresponding edges that cross each other (Fig. 15).
We did not explore this alternative mapping, so we leave it as a future research direction.
Regarding the factorization algorithm, there are also some improvements that can be done with respect to the time complexity. Our methodology uses heuristics to obtain a non-minimal factorization and then refines it by using rewriting logic. This last step becomes prohibitive for large tangles, therefore a faster approach is necessary. During our research, we did not find an algorithm capable of such performances, but we have the hypothesis that the factorization problem for the Brauer Monoid could be solved in polynomial time.
Let's discuss now some practical applications our methodology could be used for.
The factor representation we have discussed in this paper can be useful as an additional classification criterion for RNA secondary structures databases, in which a user could query RNAs that are generated only by a particular set of prime tangles, without the need of specifying the exact shape of the RNA molecule they are interested in. This could also lead to interesting applications in the context of sequence alignment, in which two sequences are compared not by the alignment of their nucleotides, but by their factor list.
R13. P i • P j = P j • P i ⇐⇒ |i − j| > 1 Fig. 15 Two different mappings. Two mapping in which pseudoknots are treated differently. s 1 and s 2 are two stems and p is a pseudoknot connecting them. a The mapping that Kauffman and Magarshak proposed. b Another mapping in which the pseudoknot corresponds to the intersection between s 1 and s 2 (grey dot)